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ABSTRACT 

There are two broad approaches for differentially private data anal¬ 
ysis. The interactive approach aims at developing customized dif¬ 
ferentially private algorithms for various data mining tasks. The 
non-interactive approach aims at developing differentially private 
algorithms that can output a synopsis of the input dataset, which 
can then be used to support various data mining tasks. In this paper 
we study the tradeoff of interactive vs. non-interactive approaches 
and propose a hybrid approach that combines interactive and non¬ 
interactive, using fc-means clustering as an example. In the hy¬ 
brid approach to differentially private fc-means clustering, one first 
uses a non-interactive mechanism to publish a synopsis of the input 
dataset, then applies the standard fc-means clustering algorithm to 
learn k cluster centroids, and finally uses an interactive approach 
to further improve these cluster centroids. We analyze the error be¬ 
havior of both non-interactive and interactive approaches and use 
such analysis to decide how to allocate privacy budget between the 
non-interactive step and the interactive step. Results from extensive 
experiments support our analysis and demonstrate the effectiveness 
of our approach. 

1. INTRODUCTION 

In recent years, a large and growing body of literature has in¬ 
vestigated differentially private data analysis. Broadly, they can 
be classified into two approaches. The interactive approach aims at 
developing customized differentially private algorithms for specific 
data mining tasks. One identifies the queries that need to be an¬ 
swered for the data mining task, analyze their sensitivity, and then 
answers them by adding appropriate noises. The non-interactive 
approach aims at developing an approach to compute, in a differ¬ 
entially private way, a synopsis of the input dataset, which can then 
be used to generate a synthetic dataset, or to directly support vari¬ 
ous data mining tasks. 

An intriguing question is which of the two approaches is bet¬ 
ter? Given an input dataset D, the desired privacy parameter e, 
which we refer to as the privacy budget, and one or more data 


analysis tasks, should one use the interactive approach or the non¬ 
interactive approach? This question is largely open. In general, the 
non-interactive approach has the advantage that once a synopsis is 
constructed, many analysis tasks can be conducted on the synopsis. 
In contrast, using the interactive approach, one is limited to execut¬ 
ing the interactive algorithm just once; any additional access to the 
dataset would violate differential privacy. Therefore, strictly speak¬ 
ing, a dataset can serve only one analyst, and for only one task. 
(One could divide the privacy budget for multiple analysts and/or 
multiple tasks, but then the accuracy for each task will suffer.) On 
the other hand, because the interactive approach is designed specif¬ 
ically for a particular data mining task, one might expect that, under 
the same privacy budget it should be able to produce more accurate 
results than the non-interactive approach. 

In this paper we initiate the study of the tradeoff of interactive 
vs. non-interactive approaches, using fc-means clustering as the ex¬ 
ample. Clustering analysis plays an essential role in data manage¬ 
ment tasks. Clustering has also been used as a prime example to 
illustrate the effectiveness of interactive differentially private data 
analysis E El [25l m EH [33 Ho). There are three state of the 
art interactive algorithms. The first is the differentially private ver¬ 
sion of the Lloyd algorithm EES), which we call DPLloyd. The 
second algorithm uses the sample and aggregation framework 03 
and is implemented in the GUPT system OD , which we call GkM. 
The third and most recent one, which we call PGkM, uses Priv- 
Gene 1401 . a framework for differentially private model fitting 
based on genetic algorithms. 

To the best of our knowledge, performing fc-means clustering 
using the non-interactive approach has not been explicitly proposed 
in the literature. In this paper, we propose to combine the following 
non-interactive differentially private synopsis algorithms with k- 
means clustering. The dataset is viewed as a set of points over a 
d-dimensional domain, which is divided into M equal-size cells, 
and a noisy count is obtained from each cell. A key decision is to 
choose the parameter M. A larger M value means lower average 
counts for each cell, and therefore noisy counts are more likely to 
be dominated by noises. A smaller M value means larger cells, and 
therefore one has less accurate information of where the points are. 

2d 

We propose a method that sets M = {^) , which is derived 

based on extending the analysis in 1341 . which aims to minimize 
errors when answering rectangular range queries for 2-dimensional 
data, to higher dimensional case. We call the resulting fc-means 
algorithm EUGkM, where EUG is for Extended Uniform Grid. 

We conducted extensive experimental evaluations for these algo¬ 
rithms on 6 external datasets and 81 datasets that we synthesized by 
varying the dimension d from 2 to 10 and the number of clusters 
from 2 to 10. Experimental results are quite interesting. GkM was 
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introduced after DPLloyd and was claimed to have accuracy advan¬ 
tage over DPLloyd, and PGkM was introduced after and compared 
GkM. However, we found that DPLloyd is the best method among 
these three methods. In the comparison of DPLloyd and GkM 
in (m, DPLloyd was run using much larger number of iterations 
than necessary, and thus perform poorly. In 1401 . PGkM was com¬ 
pared only with GkM, and not with DPLloyd. More specifically, we 
found that GkM is by far the worst among all methods. Through ex¬ 
perimental analysis of the sources of the errors, we found that it is 
possible to dramatically improve the accuracy of GkM by choosing 
smaller partitions in the sample and aggregation framework. After 
this improvement, GkM becomes competitive with PGkM. How¬ 
ever, DPLloyd, the earliest method is clearly the best performing 
algorithm among the 3 interactive algorithms. Through analysis, 
we found that why DPLloyd outperforms PGkM. The genetic pro¬ 
gramming style PGkM needs more iterations to converge. When 
making these algorithms differentially private, the privacy budget 
is divided among all iterations, thus having more iterations means 
more noise is added to each iteration. Therefore, the more direct 
DPLloyd outperforms PGkM. 

The most intriguing results are those comparing DPLloyd with 
EUGkM. For most datasets, EUGkM performs much better than 
DPLloyd. For a few, they perform similarly, and for two datasets 
DPLloyd outperforms EUGkM. Through further theoretical and 
empirical analysis, we found that while the performance of both 
algorithms are greatly affected by the two key parameters d and 
k, they are affected differently by these two parameters. DPLloyd 
scales worse when k increases, while EUGkM scales worse when 
d increases. Again we use analysis to demonstrate why this is the 
case. 

An intriguing question is can we further improve DPLloyd? The 
accuracy of DPLloyd is affected by two key factors; the number 
of iterations and the choice of initial centroids. In fact, these two 
are closely related. If the initially chosen centroids are very good 
and close to the true centroids, one only needs perhaps one iteration 
to improve it, and this reduction in the number of iterations would 
mean little noise is added. 

This leads us to propose a novel hybrid method that combines 
non-interactive EUGkM with interactive DPLloyd. We first use 
half the privacy budget to run EUGkM, and then use the cen¬ 
troids outputted by EUGkM as the initial centroids for one round of 
DPLloyd. Such a method, however, may not actually outperform 
EUGkM, especially when the privacy budget e is small, since then 
one round of DPLloyd may actually worsen the centroids. We use 
our error analysis formulas to determine whether there is sufficient 
privacy budget for such a hybrid approach to outperform EUGKM. 
We then experimentally validate the effectiveness of the Hybrid ap¬ 
proach. 

The hybrid idea is applicable to general private data analysis 
tasks which require parameter tuning. In the no-privacy setting, 
one typically tunes parameters by building models for several pa¬ 
rameters and selecting the one which offers the best utility. Under 
the differential privacy setting, such kind of parameter tuning pro¬ 
cedure does not work well since the limited privacy budget might 
be over-divided by trying many different parameters. Chaudhuri 
et al. It) proposed a method for private parameter tuning by tak¬ 
ing advantage of parallel composition. The idea is to build private 
models with different parameters on separate subset of the dataset 
and evaluate models on a validation set. The best parameter is cho¬ 
sen via exponential mechanism with quality function defined by the 
evaluation score. However, this approach is also not scalable well 
over a large set of candidate parameters which might result each 
data block to have very small number of points and therefore lead 


to very inaccurate model. Our proposed hybrid approach offers a 
better solution. We can first publish private synopses of the input 
data, on which we try a large set of parameters. Then, we run the 
interactive private analysis with the selected parameter on the input 
dataset to get the final result. 

In this paper we advance the state of art on differentially pri¬ 
vate data mining in several ways. First, we have introduced non¬ 
interactive methods for differentially private fc-means clustering, 
which are highly effective and often outperform state of the art in¬ 
teractive methods. Second, we have extensively evaluated three in¬ 
teractive methods, and one non-interactive methods, and analyzed 
their strengths and weaknesses. Third, we have developed tech¬ 
niques to analyze the error resulted from both DPLloyd and EU¬ 
GkM. Finally, we introduce the novel concept of hybrid approach 
to differentially private data analysis, which is so far the best ap¬ 
proach to fc-means clustering. We conjecture that the concept of 
hybrid differential privacy approach may prove useful in other anal¬ 
ysis tasks as well. 

The rest of the paper is organized as follows. In Section we 
discuss related work. In Sectionj^ we give preliminary information 
about differential privacy and fc-means clustering. In Section|4l we 
describe the existing three interactive approaches, DPLloyd, GkM, 
PGkM and one non-interactive approache EUGkM. In Section [S] 
we first show the experimental results on the performance com¬ 
parison among the interactive and non-interactive approaches, and 
analyze their strengths and weaknesses. In Section|^we study the 
error behavior of DPLloyd and EUGkM, introduce the hybrid ap¬ 
proach, and compare these with existing algorithms. We conclude 
in Section|7] 

2. RELATED WORK 

The notion of differential privacy was developed in a series of 
papers (Him mm [To). Several primitives for answering a single 
query differentially privately have been proposed. Dwork et al. [m 
introduced the method of adding Laplacian noise scaled with the 
sensitivity. McSherry and Talwar dzi introduced a more general 
exponential mechanism. Nissim et al. 1321 proposed adding noises 
proportion to local sensitivity. 

Blum et al. O proposed a sublinear query (SuLQ) database 
model for interactively answering a sublinear number (in the size 
of the underlying database) of count queries differential privately. 
The users (e.g. machine learning algorithms) issue queries and 
get responses which are added laplace noises. They applied the 
SuLQ framework to the fc-means clustering and some other ma¬ 
chine learning algorithms. McSherry 1281 built the PINQ (Privacy 
INtegrated Queries) system, a programming platform which pro¬ 
vides several differentially-private primitives to enable data ana¬ 
lysts to write privacy-preserving applications. These private prim¬ 
itives include noisy count, noisy sum, noisy average, and expo¬ 
nential mechanism. The DPLloyd algorithm, which we compare 
against in this paper, has been implemented using these primitives. 
Another programming framework with differential privacy support 
is Airavat, which makes programs using the MapReduce frame¬ 
work differentially private 1361 . 

Nissim et al. I32II38I propose the sample and aggregate frame¬ 
work (SAF), and use fc-means clustering as a motivating applica¬ 
tion for SAF. This SAF framework has been implemented in the 
GUPT system (M] and is evaluated by fc-means clustering. This is 
the GkM algorithm that we compared with in the paper. Dwork El 
suggested applying a geometric decreasing privacy budget alloca¬ 
tion strategy among the iterations of fc-means, whereas we use an 
increasing sequence. Geometric decreasing sequence will cause 
later rounds using increasingly less privacy budget, resulting in 
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higher and higher distortion with each new iteration. Zhang et al. 
1401 proposed a general private model fitting framework based on 
genetic algorithms. The PGkM approach in this paper is an instan¬ 
tiation of the framework to fc-means clustering. 

Interactive methods for other data mining tasks have been pro¬ 
posed. McSherry and Mironov 1261 adapted algorithms producing 
recommendations from collective user behavior to satisfy differen¬ 
tial privacy. Friedman and Schuster (m made the ID3 decision 
tree construction algorithm differentially private. Chaudhuri and 
Monteleoni (6) proposed a differentially private logistic regression 
algorithm. Zhang et al. ED introduced the functional mechanism, 
which perturbs an optimization objective to satisfy differential pri¬ 
vacy, and applied it to linear regression and logistic regression. Dif¬ 
ferentially private frequent itemset mining has been studied in 
1231 . The tradeoffs of interactive and non-interactive approaches in 
these domains are interesting future research topics. 

Most non-interactive approaches aim at developing solutions 
to answer histogram or range queries accurately EH |39l ED H. 
Dwork et al. (HI calculate the frequency of values and release their 
distribution differentially privately. Such method makes the vari¬ 
ance of query result increase linearly with the query size. To ad¬ 
dress this issue, Xiao et al. 1391 propose a wavelet-based method, 
by which the variance is polylogarithmic to the query size. Hay 
etal. ED organize the count queries in a hierarchy, and improve 
the accuracy by enforcing the consistency between the noisy count 
value of a parent node and those of its children. Cormode et al. (D 
adapted standard spatial indexing techniques, such as quadtree and 
kd-tree, to decompose data space differential-privately. Qardaji 
et al. 1341 proposed the UG and AG method for publishing 2- 
dimensional datasets. Mohammed et al. 1291 tailored the non¬ 
interactive data release for construction of decision trees. 

Roth et al. El studied the problem on how to release synthetic 
data differentially privately for any set of count queries speci¬ 
fied in advance. They proposed a e-differentially private mecha¬ 
nism whose error scales only logarithmically with the number of 
queries being answered. However, it is not computationally ef¬ 
ficient (super-polynomial in the data universe size). Subsequent 
work includes (niEQlll5lE5l[l8l[T9l. One of the typical works 
is the private multiplicative weight mechanism 1201 which is pro¬ 
posed to answer count queries interactively whose error also scales 
logarithmically with the number of queries seen so far. Its running 
time is only linear in the data universe size. 


£i, ■ • ■ , Cm results in a mechanism that satisfies e-differential pri¬ 
vacy for e = Ci- Because of this, we refer to e as the privacy 
budget of a privacy-preserving data analysis task. When a task in¬ 
volves multiple steps, each step uses a portion of e so that the sum 
of these portions is no more than e. 

There are several approaches for designing mechanisms that sat¬ 
isfy £-differential privacy, including Laplace mechanism ED and 
Exponential mechanism ED- The Laplace mechanism computes 
a function g on the dataset D by adding to g{D) a random noise, 
the magnitude of which depends on GSg, the global sensitivity or 
the Li sensitivity of g. Such a mechanism Ag is given below: 


where 


Ag{D) =p(D) + Lap(^) 




and Pr[Lap(/3) = a:] =^6-1"'/'’. 

In the above. Lap (/3) denotes a random variable sampled from 
the Laplace distribution with scale parameter p. 


3.2 k-means Clustering Algorithms 

The fc-means clustering problem is as follows: given a d- 
dimensional dataset D = {x^ , ..., x^}, partition data points 

in D into k sets O = {O^, • • • , O^} so that the Normalized 

Intra-Cluster Variance (NICV) is minimized 

( 1 ) 

The standard fc-means algorithm is the Lloyd’s algorithm ED- 
The algorithm starts by selecting fc points as the initial choices for 
the centroid. The algorithm then tries to improve these centroid 
choices iteratively until no improvement can be made. In each iter¬ 
ation, one first partitions the data points into fc clusters, with each 
point assigned to be in the same cluster as the nearest centroid. 
Then, one updates each centroid to be the center of the data points 
in the cluster. 

Vi e [i..d] o’ ^ (2) 

where j — 1, 2,... , fc, xf and o] are the i-th dimensions of x^ and 
cP, respectively. The algorithm continues by alternating between 
data partition and centroid update, until it converges. 


3. BACKGROUND 
3.1 Differential Privacy 

Informally, differential privacy requires that the output of a data 
analysis mechanism should be approximately the same, even if any 
single tuple in the input database is arbitrarily added or removed. 

Definition! (e-Differential Privacy Eol[l2l). A 

randomized mechanism A gives e-dijferential privacy if for any 
pair of neighboring datasets D and D', and any S € Range (A), 

Pr [A{D) = 5] < e' • Pr [A{D') = S] . 

In this paper we consider two datasets D and D' to be neighbors 
if and only if either D = D' + t or D' = D + t, where D + t 
denotes the dataset resulted from adding the tuple t to the dataset 
D. We use D ~ D' to denote this. This protects the privacy of any 
single tuple, because adding or removing any single tuple results in 
e'^-multiplicative-bounded changes in the probability distribution 
of the output. 

Differential privacy is composable in the sense that com¬ 
bining multiple mechanisms that satisfy differential privacy for 


4. THE INTERACTIVE AND NON¬ 
INTERACTIVE APPROACHES 

In this section, we describe 3 interactive approaches and 2 non¬ 
interactive approaches to differential private fc-means clustering. 

4.1 Interactive Approaches 

4.1.1 DP Lloyd 

Differentially private fc-means or LLoyd’s algorithm was first 
proposed by Blum et al. El and was later implemented in the PINQ 
system 1281 , a platform for interactive privacy preserving data anal¬ 
ysis. We call this the DPLloyd approach. DPLloyd differs from the 
standard Lloyd algorithm in the following ways. First, Laplacian 
noise is added to the iterative update step in the Lloyd algorithm. 
Second, the number of iterations needs to be fixed in order to decide 
how much noise needs to be added in each iteration. 

Each iteration requires computing the total number of points in 
a cluster and, for each dimension, the sum of the coordinates of the 
data points in a cluster. Let t be the number of iterations, and d be 
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the number of dimensions. Then, each tuple is involved in answer¬ 
ing dt sum queries and t count queries. To bound the sensitivity of 
the sum query to a small number r, each dimension is normalized 
to [—r, r]. Thus, the global sensitivity of DPLloyd is {dr -|- l)t, and 
each query is answered by adding Laplacian noise Lap j, 

There are two issues that greatly impact the accuracy of 
DPLloyd. The first is the number of iterations. A large number of 
iterations causes too much noises being added. A small number of 
iterations may be insufficient for the algorithm to converge. In 1281 . 
the number of iterations is set to be 5, which seems to work well 
for many settings. The second is the quality of initial centroids. A 
poor choice of initial centroids can result in converging to a local 
optimum that is far from global optimum, or not converging after 
the given number of iterations. While many methods for choosing 
the initial points have been developed 1331 . these methods were de¬ 
veloped without the privacy concern and need access to the dataset. 
In 1281 . k points at uniform random from the domain are chosen 
as the initial centroids. We have observed empirically that this can 
perform poorly in some settings, since some randomly chosen ini¬ 
tial centroids are close together. We thus introduce an improved 
method for choosing initial centroids that is similar to the concept 
of sphere packing. Given a radius a, we randomly generate k cen¬ 
troids one by one such that each new centroid is of distance at least 
a away from each border of the domain and each new centroid is 
of distance at least 2a away from any existing centroid. When a 
randomly chosen point does not satisfy this condition, we generate 
another point. When we have failed repeatedly, we conclude that 
the radius a is too large, and try a smaller radius. We use a binary 
search to find the maximal value for a such that it is the process of 
choosing k centroids succeed. This process is data independent. 

4.1.2 GkM 

The fc-means clustering problem was also used to motivate the 
sample and aggregate framework (SAP) for satisfying differential 
privacy, which was developed in I32II38I . and implemented in the 
GUPT system ISTl . 

Given a dataset D and a function /, SAP first partitions D into 
£ blocks, then it evaluates / on each of the block, and finally it 
privately aggregates results from all blocks into a single one. Since 
any single tuple in D falls in one and only one block, adding one 
tuple can affect at most one block’s result, limiting the sensitivity 
of the aggregation step. Thus one can add less noise in the final 
step to satisfy differential privacy. 

As far as we know, GUPT ED is the only implementation of 
SAP. Authors of ED implemented fc-means clustering and used 
it to illustrate the effectiveness of GUPT. We call this algorithm 
GkM. Given a dataset D, it first partitions D into i blocks 
Di,D 2 , ..., Di. Then, for each block D;, (1 < fo < £), it cal¬ 
culates its k centroids o^’^, o^’^,..., Pinally, it averages the 
centroids calculated from all blocks and adds noise. Specifically, 
the Pth dimension of the j’th aggregated centroid is 


bound mini and maxi to be within the data domain. This does not 
affect the privacy, was able to greatly improve the accuracy. In this 
paper we use this fixed version. 

Here a key parameter is the choice of 1. Intuitively, a larger t 
will result in each block being very small and unable to preserve 
the cluster information in the blocks, and a smaller I, on the other 
hand, results in large noise added. (Note the inverse dependency 
on £ in Equation E}- Analysis in 1311 suggests to set £ = 

Our experimental results, however, show that the performance is 
quite poor. We consider a variant that chooses £ = ^, i.e., having 
each block containing 3fc points, which performs much better than 
setting £ = W®'"*. 

4.1.3 PGkM 

PrivGene 1401 is a general-purpose differentially private model 
fitting framework based on genetic algorithms. Given a dataset D 
and a fitting-score function f{D, 6) that measures how well the pa¬ 
rameter 6 fits the dataset D, the PrivGene algorithm initializes a 
candidate set of possible parameters 9 and iteratively refines them 
by mimicking the process of natural evolution. Specifically, in each 
iteration, PrivGene uses the exponential mechanism ED to pri¬ 
vately select from the candidate set m' parameters that have the 
best fitting scores, and generates a new candidate set from the m' 
selected parameters by crossover and mutation. Crossover regards 
each parameter as an f-dimensional vector. Given two parameter 
vectors, it randomly selects a number £ such that 0 < £ < £ and 
splits each vector into the first £ dimensions in the vector and the re¬ 
maining £ — £ dimensions (the lower half). Then, it swaps the lower 
halves of the two vectors to generate two child vectors. These vec¬ 
tors are then mutated by adding a random noise to one randomly 
chosen dimension. 

In 1401 . PrivGene is applied to logistic regression, SVM, and 
fc-means clustering. In the case of fc-means clustering, the NICV 
formula in Equation [T] more precisely its non-normalized version, 
is used as the fitting function /, and the set of k cluster centroids 
is defined as parameter 6. Each parameter is a vector of £ = k ■ d 
dimensions. Initially, the candidate set is populated with 200 sets 
of cluster centroids randomly sampled from the data space, each set 
containing exactly k centroids. Then, the algorithm runs iteratively 
for max{8, {xN€)/m'} rounds, where x and m' are empirically 
set to 1.25 X 10“^ and 10, respectively, and N is the dataset size. 

We call the approach of applying PrivGene to fc-means clustering 
PGkM, which is similarly to DPLloyd in that it tries to iteratively 
improve the centroids. However, rather than maintaining and im¬ 
proving a single set of k centroids, PGkM maintains a pool of can¬ 
didates, uses selection to improve their quality, and crossover and 
mutation to broaden the pool. Similar to DPLloyd, a key parameter 
is the number of iterations. Too few iterations, the algorithm may 
not converge. Too many iterations means too little privacy budget 
for each iteration, and the exponential mechanism may not be able 
to select good candidates. 




2{maxi — mini) ■ k ■ d 


(3) 


where o’P is the i’th dimension of o^’^, [mini, maxi] is the esti¬ 
mated output range of i’th dimension. One half of the total privacy 
budget is used to estimate this output range, and the other half is 
used for adding Laplace noise. 

We have found that the implementation downloaded from 1301 . 
which uses Equation El>, performed poorly. Analyzing the data 
closely, we found that mini and maxi often fall outside of the 
data range, especially for small e. We slightly modified the code to 


4.2 Non-interactive Approaches 

Interactive approaches such as DPLloyd and GkM suffer from 
two limitations. First, often times the purpose of conducting k- 
means clustering is to visualize how the data points are partitioned 
into clusters. The interactive approaches, however, output only 
the centroids. In the case of DPLloyd, one could also obtain the 
number of data points in each cluster; however, it cannot provide 
more detailed information on what shapes data points in the clus¬ 
ters take. The value of interactive private fc-means clustering is thus 
limited. Second, as the privacy budget is consumed by the interac¬ 
tive method, one cannot perform any other analysis on the dataset; 
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doing so will violate differential privacy. 

Non-interactive approaches, which first generate a synopsis of 
a dataset using a differentially private algorithm, and then apply 
fc-means clustering algorithm on the synopsis, avoid these two lim¬ 
itations. In this paper, we consider the following synopsis method. 
Given a c?-dimensional dataset, one partitions the domain into M 
equal-width grid cells, and then releases the noisy count in each 
cell, hy adding Laplacian noise to each cell count. 

The synopsis released is a set of cells, each of which has a rect¬ 
angular bounding box and a (noisy) count of how many data points 
are in the bounding box. The synopsis tells only how many points 
are in a cell, but not the exact locations of these points. For the 
purpose of clustering. We treat all points as if they are at the center 
of the bounding box. In addition, these noisy counts might be neg¬ 
ative, non-integer, or both. A straightforward solution is to round 
the noisy count of a cell to be a non-negative nearest integer and 
replicate the cell center as many as the rounded count. This ap¬ 
proach, however, may introduce a significant systematic bias in the 
clustering result, when many cells in the UG synopsis are empty or 
close to empty and these cells are not distributed uniformly. In this 
case, simply turning negative counts to zero can produce a large 
number of points in those empty areas, which can pull the centroid 
away from its true position. We take the approach of keeping the 
noisy count unchanged and adapting the centroid update procedure 
in fc-means to use the cell as a whole. Specifically, given a cell with 
center c and noisy count n, its contribution to the centroid is c x n. 
Using this approach, in one cluster, cells who have negative noisy 
count can “cancel out” the effect of other cells with positive noise. 
Therefore, we can have better clustering performance. 

For this method, the key parameter is M, the number of cells. 
When M is large, the average count per cell is low, and the noise 
will have more impact. When M is small, each cell covers a large 
area, and treating all points as at the center may be inaccurate when 
the points are not uniformly distributed. We now describe two 
methods of choosing M. 

4.2.1 EUGkM 

Qardaji et al. (Ml studied the effectiveness of producing dif¬ 
ferentially private synopses of 2-dimensional datasets for answer¬ 
ing rectangular range counting queries (i.e., how many data points 
there are in a rectangular range) with high accuracy, and suggested 
choosing M = We now analyze the choice of M for higher¬ 
dimensional case. We use mean squared error (MSB) to measure 
the accuracy of est with respect to act. That is, 

MSE (est) = E [(est — acf)^] = Var (est) + (Bias (esf))^, 

where Var (est) is the variance of est and Bias (est) is its bias. 

There are two error sources when computing est. First, Laplace 
noises are added to cell counts to satisfy differential privacy. This 
results in the variance of est. Since counting a cell size has the 
sensitivity of 1, Laplace noise Lap (i) is added. Thus, the noisy 
count has the variance of ^. Suppose that the given counting query 
covers a portion of the total M cells in the data space. Then, 
Var (est) = Second, the given counting query may not fully 

contain the cells that fall on the border of the query rectangle. To 
estimate the number of points in the intersection between the query 
rectangle and the border cells, it assumes that data are uniformly 
distributed. This results in the bias of est, which depends on the 
number of tuples in the border cells. The border of the given query 
consists of 2d hyper rectangles, each being (d — l)-dimensional. 
The number of cells falling on a hyper rectangle is in the order of 

d-l 

M ~3~. On average the number of tuples in these cells is in the 


order of M a ' ^ • Therefore, we estimate the bias of est 

mh 

with respect to one hyper rectangle to be /3—where /3 > 0 is 

Ml 

, 2 

a parameter. We thus estimate (Bias (est))'^ to be 2d I /3-^ 




Summing the variance and the squared bias, it follows that 
, 2M 

MSE (est) = a —^—h / 


M3 


To minimize the MSE, we set the derivative of the above equation 
with respect to M to 0. This gives 




(4) 


where 6 = y name the above extended approach as EUG 

(extended uniform griding approach). We use EUGkM to represent 
the EUG-based fc-means clustering scheme. 


5. PERFORMANCE AND ANALYSIS 

In this section, we compare and analyze the performance of the 
five methods introduced in the last section. 

5.1 Evaluation Methodology 

We experimented with six external datasets and a group of syn- 
theticly generated datasets. The first dataset is a 2D synthetic 
dataset SI HU, which is a benchmark to study the performance 
of clustering schemes. SI contains 5,000 tuples and 15 Gaussian 
clusters. The Gowalla dataset contains the user checkin locations 
from the Gowalla location-based social network whose users share 
their checking-in time and locations (longitude and latitude). We 
take all the unique locations, and obtain a 2D dataset of 107,091 
tuples. We set fc = 5 for this dataset. The third dataset is a 1- 
percentage sample of road dataset which was drawn from the 2006 
TIGER (Topologically Integrated Geographic Encoding and Refer¬ 
encing) dataset (5). It contains the GPS coordinates of road inter¬ 
sections in the states of Washington and New Mexico. The fourth is 
Image iflfil . a 3D dataset with 34,112 RGB vectors. We set fc = 3 
for it. We also use the well known Adult dataset (T]. We use its 
six numerical attributes, and set fc = 5. The last dataset is Lifesci. 
It contains 26,733 records and each of them consists of the top 10 
principal components for a chemistry or biology experiment. As 
previous approaches 13II1401 . we set k — 3. Table [T] summarizes 
the datasets. For all the datasets, we normalize the domain of each 
attribute to [-1.0, 1.0]. 

When generating the synthetic datasets, we fix the dataset size 
to 10,000, and vary fc and d from 2 to 10. For each dataset, fc 
well separated Gaussian clusters of equal size are generated, and 30 
sets of initial centroids are generated in the same way as in Section 

elu 

Implementations for DPLloyd and GkM were downloaded from 
(25) and 1301 . respectively. The source code of PGkM 1401 was 
shared by the authors. We implemented EUGkM. 

Configuration. Each algorithm outputs fc centroids o = 
{o^, o^, • • • , o*}. To evaluate the quality of such an output o, we 
compute the average squared distance between any data point in D 
and the nearest centroid in o, and call this the NICV. 

We note that since both DPLloyd and EUGkM use Lloyd-style 
iteration, they are affected by the choice of initial centroids. In 
addition, all algorithms have random noises added somewhere to 
satisfy differential privacy. To conduct a fair comparison, we need 
to carefully average out such randomness effects. GkM and PGkM 
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EUGkM works well in most cases. 


Table 1: Descriptions of the Datasets. 


Dataset 

# of tuples 

d 

k 

V,kM 

V,kM-3K 

SI 

5,000 

2 

15 

30 

III 

Gowalla 

107,091 

2 

5 

103 

7.139 

TIGER 

16.281 

2 

2 

48 

2,714 

Image 

34,112 

3 

3 

65 

3.790 

Adult-num 

48,841 

6 

5 

75 

3,256 

Lifesci 

26,733 

10 

3 

59 

2.970 

Synthetic 

10.000 

[2, 10| 

|2. 10] 

40 

10000/(3fc) 


do not take a set of initial centroids as input. GkM divides the input 
dataset into multiple blocks, and for each block invokes the stan¬ 
dard fc-means implementation from the Scipy package (37) with a 
different set of initial centroids to get the result, and finally aggre¬ 
gates the outputs for all the blocks. We run GkM and PGkM 100 
times and report the average result. 

For DPLloyd, we generate 30 sets of initial centroids, run 
DPLloyd 100 times on each set of initial centroids, and we re¬ 
port the average of the 3000 NICV values as the final evaluation 
of DPLloyd. 

The non-interactive approach (EUGkM) has the advantage that 
once a synopsis is published, one can run fc-means clustering with 
as many sets of initial centroids as one wants and choose the result 
that has the best performance relative to the synopsis. In our experi¬ 
ments, given a synopsis, we use the same 30 sets of initial centroids 
as those generated for the DPLloyd method. For each set, we run 
clustering and output a set of k centroids. Among all the 30 sets of 
output centroids, we select the one that has the lowest NICV rela¬ 
tive to the synopsis rather than to the original dataset. This process 
ensures selecting the set of output centroids satisfies differential 
privacy. We then compute the NICV of this selected set relative to 
the original dataset, and take it as the resulting NICV with respect 
to the synopsis. To deal with the randomness introduced by the 
process of generating synopsis, we generate 10 different synopses 
and take the average of the resulting NICV. 

As the baseline, we run standard fc-means algorithm (H over 
the same 30 sets of initial centroids and take the minimum NICV 
among all the 30 runs. 

Experimental Results. Figure [T] reports the results for the 6 ex¬ 
ternal datasets. For these, we vary e from 0.05 to 2.0 and plot the 
NICV curve for the methods mentioned in Section|4] This enables 
us to see how these algorithms perform under different e. 

Figurel^reports the results for the synthetic datasets. For these, 
we fix e = 1.0 and report the difference of NICV between each 
approach and the baseline. This enables us to see the scalability of 
these algorithms when k and d increase. 

For interactive approaches, DPLloyd has the best performance 
in most cases. Its performance is worse than that of PGkM only 
on the small dataset S1 when the privacy budget e is smaller than 
0.15. Comparing DPLloyd and EUGkM, we observe that in the 
four low dimensional datasets (SI, Gowalla, TIGER and Image), 
EUGkM clearly outperforms DPLloyd at small e value and their 
gap becomes smaller as e increases. However, in the two high di¬ 
mensional datasets (Adult-num and Lifesci), DPLloyd outperforms 
EUGkM almost in all given privacy budgets. Similar results can 
also be found in Figure Figure]^ also exhibits the effects of the 
number of clusters and the number of dimensions. The EUGkM’s 
performance is more sensitive to the increase of dimension, while 
DPLloyd gets worse quickly as the number of clusters increases. 
Below we analyze these algorithms to understand why they per¬ 
form in this way. In addition. Figure shows the difference of 
EUGkM’s performance on different 6 choices. Setting ^ = 10 for 


5.2 The Analysis of the GkM Approach 

From Figures[T]and[2 it is clear that GkM is always much worse 
than others. There are two sources of errors for GkM. One is that 
GkM is aggregating centroids computed from the subsets of data, 
and this aggregate may be inaccurate even without adding noise. 
The other is that the noise added according to Equation (0 may be 
too large. To tease out the role played by these two error sources. 
Figure [3] shows the effect of varying block size from around 
to N. It shows error from GkM, error from using the aggregation 
without noise (SAG), and error from adding noise computed by 
Equation[3 to the best known centroids (Noise). From the figure, 
it is clear that setting £ = which corresponds to block size 

of ® is far from optimal, as the error GkM is dominated by that 
from the noise, and is much higher than the error due to sample and 
aggregation. Indeed, we observed that as the block size decreases 
the error of GkM keeps decreasing, until when the block size gets 
close to k. It seems that even though many individual blocks result 
in poor centroids, aggregating these relatively poor centroids can 
result in highly accurate centroids. This effect is most pronounced 
in the Tiger dataset, which consists of two large clusters. The two 
centroids computed from each small block can be approximately 
viewed as choosing one random point from each cluster. When 
averaging these centroids, one gets very close to the true centroids. 

This observation motivated the introduction of GkM-3K algo¬ 
rithm, which fixes each block size to be 3k. Recall that we are to 
select k centroids from each block. As can be seen from Figures[T] 
and|2 GkM-3K becomes competitive with PGkM, sometimes sig¬ 
nificantly outperforms PGkM (e.g. TIGER and Lifesci), although 
it still underperforms DPLloyd. 

5.3 The Analysis of the PGkM Approach 

PGkM is a stochastic fc-means method based on genetic algo¬ 
rithms. A stochastic method converges to global optimum l22l . On 
the contrary, DPLloyd is a gradient descent method derived from 
the standard Lloyd’s algorithm ED, which may reach local opti¬ 
mum. However, PGkM is still inferior to DPLloyd in Figure[T] 

There are two possible reasons. First, a stochastic approach typi¬ 
cally takes a ‘larger’ number of iterations to converge lEH. Fig¬ 
ure |4] compares the Lloyd’s algorithm with Gene (i.e., the non¬ 
private version of PGkM without considering differential privacy). 
For Lloyd, we reuse the initial centroids generated in Section [5T] 
Given a dataset, we run Lloyd on the 30 sets of initial centroids 
generated for the dataset, and report the average NICV. Gener¬ 
ally, Gene overtakes Lloyd as the number of iterations increases 
and finally converges to the global optimum. However, Lloyd im¬ 
proves its performance much faster than Gene in the first few itera¬ 
tions, and converges to the global optimal (or local optimum) more 
quickly. For example, in the Image dataset, Lloyd reaches the best 
baseline after three iterations, while the Gene needs more than 10 
iterations to achieve the same. The second reason that PGkM is 
inferior to DPLloyd is the low privacy budget allocated to select a 
parameter (i.e., a set of fc cluster centroids) from the candidate set. 
In each iteration PGkM selects 10 parameters, and the total num¬ 
ber of iterations is at least 8. Thus, the privacy budget allocated 
to select a single parameter is at most e/80. Therefore, PGkM has 
reasonable performance only for big e value. 

6. THE HYBRID APPROACH 

Experimental results in Section 0 establish that DPLloyd is 
the best performing interactive method; however, it still under¬ 
performs EUGkM. Recall that EUGkM publishes a private syn- 


6 















NoPrivacy - EUGkM —•— DPLloyd 



(a) SI [d = 2, fc = 15] 



(c) Gowalla [d = 2, it = 5] 


GkM —B— GkM-3K PGkM 



(b) Image [d = 3, fc = 3] 
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Privacy Budget s, log scale 

(d) Adult-num [d = 6, fc = 5] 




(e) TIGER [d = 2, fc = 2] 


(f) Lifesci [d = 10, fc = 3] 


Figure 1: The comparison of DPLloyd, EUGkM, PGkM and GkM. x-axis: privacy bndget e in log-scale, y-axis: NICV in log-scale. 


opsis of the the dataset, and thus enables other analysis to be per¬ 
formed on the dataset, beyond fc-means. This means that currently 
the non-interactive method has a clear advantage over interactive 
methods, at least for fc-means clustering. 

An intriguing question is “Whether EUGkM is the best we can 
do for fc-means clustering?” In particular, can we further improve 
DPLloyd? Recall that there are two key issues that greatly affect 
the accuracy of DPLloyd: the number of iterations and the choice 
of initial centroids. In fact, these two are closely related. If the 
initially chosen centroids are very good and close to the true cen¬ 
troids, one only needs perhaps one more iteration to improve it, and 
this reduction in the number of iterations would mean little noise 
is added. Now if only we have a method to choose really good 
centroids in a differentially private way, then we can use part (e.g.. 


half) of the privacy budget to get those initial centroids, and the re¬ 
maining privacy budget to run one iteration of DPLloyd to further 
improve it. 

In fact, we do have such a method. EUGkM does it. This leads us 
to propose a hybrid method that combines non-interactive EUGkM 
with interactive DPLloyd. We first use half the privacy budget to 
run EUGkM, and then use the centroids outputted by EUGkM as 
the initial centroids for one round of DPLloyd. Such a method, 
however, may not actually outperform EUGkM, especially when 
the privacy budget e is small, since then one round of DPLloyd 
may actually worsen the centroids. Therefore, when e is small, 
we should stick to the EUGkM method, and only when e is large 
enough should we adopt the EUGkM-t-DPLloyd approach. In order 
to determine what e is large enough, we analyze how the errors 
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Figure 2: The heatmap by varying k and d 


depend on the various parameters in DPLloyd and in EUGkM. 

6.1 Error Study of DPLloyd 

DPLloyd adds noises to each iteration of updating centroids. To 
study the error behavior of DPLloyd due to the injected Laplace 
noises, we focus on analyzing the mean squared error (MSE) be¬ 
tween noisy centroids and true centroids in one iteration. 

Consider one centroid and its update in one iteration. The true 
centroid’s i’th dimension should be Oi = ^, where C is the num¬ 
ber of data points in the cluster and Si is the sum of i’th dimension 
coordinates of data points in the cluster. Consider the noisy cen¬ 


troid o; its i’th dimension is oj = where AC is the noise 

added to the count and /S.Si is the noise added to the Si. The MSE 
is thus: 


MSE(o) = E 


f Si + ASi 

^ V c-h AC 
.2 = 1 ' 



(5) 


Derivation based on the above formula gives the following 
proposition. 
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(a) SI [d = 2, fc = 15] 
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(c) TIGER [d = 2, fc = 2] 
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Figure 3: The analysis of the GkM Approach, x-axis: block size exponent in log-scale, y-axis: NICV in log-scale. 
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Figure 4: The comparison of the convergence rate of the genetic algorithm based k-means and Lloyd algorithm, x-axis: number of 
iterations in log-scale, y-axis: NICV in log-scale. 


Proposition 1. In one round of DPUoyd, the MSE is 


The last step holds, because ASi and AC are independent zero- 
mean Laplacian noises and the following formulas hold: 


V {Ney 

Proof. Let us first consider the MSE on the i-th dimension. 


f E[AS'iAC] = 0 

i E[(ASi)"] = E[(AS0"1 - = Var(ASi) 

[EliACf] = E[(AC)^] - (E[AC])^ = Var(AC), 


MSE(oi) = E 


Si+ASi 
. C+AC 


CASi-SiAC^2I 


~C^ 

_El(ASif- 

-—— 

_ Var(ASi) 

- —— 


E[Sf(ACy 

H-- 

S?Var(AC) 


I 2CSiE[ASiAC] 

H - 


where Var (ASi) and Var (AC) are the variances of ASi and AC, 
respectively. 

Suppose that on average = p, where [—r, r] is the range of 
the i’th dimension. That is, p is the normalized coordinate of i-th 
dimension of the cluster’s centroid. Eurthermore, suppose that each 
cluster is about the same size, i.e., C ~ Then, MSE (oi) can be 
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approximated as follows: 

MSE(o;) « ^(Var(AS'i) + (2/3r)"-Var(AC')) ( 6 ) 

DPLloyd adds to each sum/count function Laplace noise 
Lgp ^ j, Therefore, both Var(ASi) and Var(AC) are 

equal to From Equation ® we obtain 

r .2 

MSE(oi) « ^ (Var(ASi) + (2pr)"-Var(AC)) 

. 2(I + ( 2 rf)(L(^y^ 

As the noise added to each dimension is independent, from Equa- 
tion[5]we know that the MSE is 

MSE (o) = ^ MSE (oi) « 2d(l + {2prf) ( 'j ( 7 ) 

When r is a small constant, this becomes 0 ■ D 

Proposition [T] shows that the distortion to the centroid propor¬ 
tional to while inversely proportional to (Ne)'^. At first 

glance, this analysis seems to conflict with the experimental result 
in Figure 12 (a), where DPLloyd is much less scalable to k than to 
d. The reason behind is that the performance of DPLloyd is also 
affected by the fact that 5 rounds are not enough for it to converge. 
When k increases, converging takes more time, and it is also more 
likely that choices of initial centroids lead to local optima that are 
far from global optimum. 

6.2 Error Study of EUGkM 

Non-interactive approach partitions a dataset into a grid of M 
uniform cells. Then, it releases private synopses for the cells, and 
runs fc-means clustering on the synopses to return the cluster cen¬ 
troids. Similar to the error analysis for DPLloyd, we analyze the 
MSE. Let o be the true centroid of a cluster, and o be its estimator 
computed by a non-interactive approach. The MSE between 0 and 
o is composed of two error sources. Eirst, the count in each cell is 
inaccurate after adding Laplace noise. This results in the variance 
(i.e., Var (o)) of o from its expectation E [o]. Second, we no longer 
have the precise positions of data points, and only assume that they 
occur at the center in a cell. Thus, the expectation of o is not equal 
to o, resulting in a bias (i.e., Bias (o)). The MSE is the combination 
of these two errors. 


to the cell size. Let ol be the i-th dimension of the noisy centroid. 
Then, the variance of a; is 


In the above, the first step follows because Ti as the cube geo¬ 
metric center is a constant. The last step is derived by assuming 
-l-t't) « C, that is, the noisy cluster size is approximately 
equal to the original cluster size C. 

We can see that within the cube, different cells’ contribution to 
the variance is not the same. Basically, the closer a cell is to the 
cube center, the less its contribution. The contribution is propor¬ 
tional to the squared distance to the cube center. We thus approxi¬ 
mate the variance as follows: 


Var ( 6 i) = Var (oi — n) 


, EteT(=t+‘"t) 


Var 

Var 




c 


h ~ "^ 0 ^ • Var {ct + vt)) . 


Var (oi) 


1 2( M 

2Mr^ 


w 


d-l 



SC^e^k- 


dx 


In the above integral, x in the first term is the distance from a cell 
center to the cube center (i.e., ti — Ti). The second term is 

\2.r) 

the number of cells per unit volume, and w'^~^ is the volume of the 
(d— 1 )-dimensional plane that has a distance of x to the cube center. 
The last term ^ is the variance of the cell size (i.e., Var (ct -|- vt)). 
Suppose that clusters are of equal size, that is, C = -p. Then, the 
variance of the noisy centroid by summing all the d dimensions is 


Var (o) 


o d-2 

2dMr^k~r 

sivv 


(9) 


The analysis shows that the variance of the EUGkM is propor¬ 
tional to EUGkM sets M to (■^)^^- Plugging it into 

Equation 12 we get that the variance of EUGkM is inversely pro- 

4 

portioned to (Ne) 2 +^ . 

Analyzing the bias. Let Xi be the i’th dimension coordinate of a 
tuple X. Then, the bias of ol is 

Bias (oi) = E [oi] — Oi 

L J Yt&T 

^ YtGT 

~ C ’ 


MSE (d) = Var (d) + (Bias (d))^ (8) 

Analyzing the variance. We assume that each cluster has a volume 
that is i of the total volume of the data space, and has the shape of 
a cube. In d-dimensional case, the width of the cube is ui = 

\/k 

Suppose that the geometric centef] of the cube is Ti. Let T be the 
set of cells included in the cluster. For each cell t £ T, we use Ct 
to denote the number of tuples in t, ti to denote the dth dimension 
coordinate of the center of cell t, and vt to denote the noise added 


where the last step is developed by approximating -f r't) 

to the cluster size C. 

The bias developed in the above formula is dependent on data 
distribution. Its precise estimation requires to access real data. 
We thus only estimate its upper bound. Let qi = ti — Xi. Non¬ 
interactive approach partitions each dimension into s/M intervals 
of equal length. Hence, qi falls in the range of [— 3 ^, -^=], and 

the upper bound of Bias (oi) is -^=- Summing all the d dimen¬ 
sions, we obtain the upper bound of squared bias of noisy centroid 


'Note that this is not the cluster centroid. 


(Bias(o))'^ < 


( 10 ) 


10 

















The estimation shows that the upper bound of squared bias de- 
2 

creases as a function of . This is consistent with the expec¬ 
tation. As M increases, the data space is partitioned into finer- 
grained cells. Therefore, the distance between a tuple in a cell to 
the cell center decreases on average. 

Comparing DPLIoyd and EUGkM. We now analyze the perfor¬ 
mance of DPLIoyd and EUGkM in Figure Q] Equation |7] shows 
that the MSE of DPLIoyd is inversely proportional to (We)^. The 
MSE of EUGkM consists of variance and squared bias. Plugging 

2d 

M = into Equation [3 and Inequality [Tol it follows 

4 

that the MSE of EUGkM is inversely proportional to (We) 2+^. 
This explains why the NICV of DPLIoyd, which is inversely pro¬ 
portional to (We)^ drops much faster than that of EUGkM as e 
grows. It also explains why DPLIoyd has better performance on 
‘big’ dataset (e.g., the TIGER dataset). 

The MSE of EUGkM is inversely proportional to (We) 2+^. 
Thus, it increases exponentially as a function of d. Instead, from 
Equation |7] it follows that the MSE of DPLIoyd has only cubic 
growth with respect to d. Therefore, in Figure [T] as the dimen¬ 
sionality of dataset increases, DPLIoyd outperforms EUGkM. This 
also explains in Figure why DPLIoyd is more scalable to d than 
EUGkM. 


6.3 The Hybrid Approach 

Our hybrid approach combines EUGkM and DPLIoyd. Given 
a dataset and privacy budget e, the hybrid approach first checks 
whether it overtakes the DPLIoyd method and also the EUGkM 
method. If this is not the case, the hybrid approach simply falls 
back to EUGkM. Otherwise, the hybrid approach allocates half 
privacy budget to EUGkM to output a synopsis and find k inter¬ 
mediary centroids that work well for the synopsis. Then, it runs 
DPLIoyd for one iteration using the remaining half privacy budget 
to refine these k centroids. 

We use MSE to heuristically determine the conditions, on which 
the hybrid approach overtakes the DPLIoyd method and also the 
EUGkM method. Basically, we require that the MSE of the hybrid 
approach be smaller than those of the other two approaches, since 
smaller MSE implies smaller error to the cluster centroid. From 
Equation |7] it follows that the MSE of DPLIoyd with full privacy 
budget is 


2 d(l -f {2prf) 


/ kt{dr + 1) 

V We 


2 


( 11 ) 


A precise estimation of the MSE of the EUGkM method requires 
to access the dataset, since the bias depends on the real data distri¬ 
bution. However, we have the approximate variance (Equation]^ 

2d 

by setting M = (‘^) "+W 


o . V d-2 

2dr^{k)~s~ 

4 

3 X (10)2+a(We)2+d 


( 12 ) 


One-iteration DPLIoyd with half privacy budget outputs the final k 
cluster centroids, if it is applied in the hybrid approach. Therefore, 
we approximate the MSE of the hybrid approach by that of the one- 
iteration DPLIoyd 

8 d(l + {2prf) , (13) 

which is developed by setting t = 1 and privacy budget to 0.5e in 
Equation]?] 


Comparing Formulas 1111 and [T3] it follows that the MSE of the 
hybrid approach is lower than or equal to that of the DPLIoyd if 

t > 2. (14) 


Variance is the lower bound of MSE. Thus, if the MSE of the 
hybrid approach is equal to or smaller than the variance of the EU¬ 
GkM method, then it is sure that the hybrid approach has lower 
MSE. Setting Formula ]T3 smaller than or equal to Formula \T2\ 
yields 


e > 



2 + d 

~2dr 


(15) 


where 


W = 8d(l + i2prf) 


/ k{dr + 1) 

V N 


2 


and 

o . . d~2 

Y ^ 2dr (fc)^ 

3 X (10)^W5T3 

Ineaualitiesll4landll5lgive the conditions of applying the hybrid 
approach. Inequality ]T4] is automatically satisfied since DPLIoyd 
runs for f = 5 iterations. 


6.4 Experimental results 

We now compare the hybrid approach with EUGkM and 
DPLIoyd. The configuration for EUGkM and DPLIoyd is the same 
as in Section [5T] For the hybrid approach, we run EUGkM 10 
times to output 10 sets of intermediate centroids. Then we run 
DPLIoyd 10 times on each intermediate result. We finally report 
the average of 100 NICV values. Figure]3gives the results on the 
six external datasets. In low dimensional datasets (SI, Gowalla, 
TIGER, and Image), the hybrid approach simply falls back to 
EUGkM for small e value. When e increases, both the hybrid 
approach and EUGkM converge to the baseline with the former 
having slightly better performance. For example, in the Gowalla 
dataset for e = 0.7, the average NICV of the hybrid approach is 
0.02172 and that of EUGkM is 0.02174. 

In higher dimensional datasets (Adult-num and Lifesci), the hy¬ 
brid approach outperforms the other two approaches in most cases. 
It is worse than DPLIoyd only for a few small e values, on which it 
falls back to EUGkM. There are two possible reasons. The first is 
that the MSE analysis assumes that datasets are well clustered and 
each cluster has equal size, but the real datasets are skewed. For 
example, the baseline approach partitions the Adult-num dataset 
into 5 clusters, in which the biggest cluster contains 13,894 tuples 
and the smallest contains 3,160 tuples. The second is that we use 
the variance of EUGkM as the lower bound of its MSE. Thus, it is 
possible that the MSE of the hybrid approach (approximated by the 
MSE of one-iteration DPLIoyd with half privacy budget) is larger 
than the variance of EUGkM, but actually smaller than its MSE. 
In such cases, the hybrid approach gives lower NICV if it does not 
fall back to EUGkM. For example, on the Adult-num dataset for 
e = 0.05, the hybrid approach of falling back to EUGkM has the 
NICV of 0.370, while its NICV is 0.244, if it applies EUGkM plus 
one-iteration of DPLIoyd. 

We also evaluate the approaches using the synthetic datasets as 
generated in Section BTI Figure ]^ clearly shows that the hybrid 
approach is more scalable than EUGkM with respect to both k and 
d. This confirms the effectiveness of the hybrid approach. 

Figure]7]presents the runtime of DPLIoyd and EUGkM on the six 
external datasets. We follow the same experiment configuration as 
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(c) TIGER [d = 2,k = 2\ 



(d) Image [d = 3, fc = 3] 


(e) Adult-num [d = 6, fc = 5] 


(f) Lifesci [d = 10, fc = 3] 


Figure 5: The comparison of the Hybrid approach with EUGkM and DPLloyd. x-axis: privacy budget e in log-scale, y-axis: NICV 
in log-scale. 


in Section lsTI As expected, the runtime of DPLloyd is much lower 
than that of EUGkM. This is because EUGkM has to run fc-means 
clustering over 30 sets of initial centroids and output the centroids 
with the best NICV relative to the noisy synopsis. Another reason 
is that DPLloyd sets the number of iterations to 5 while EUGkM 
runs fc-means clustering until converge. 


d 



k 


(a) EUGkM 



Figure 6: Comparing hybrid and EUGkM by the heatmap 



Datasets 


Figure 7: Comparing running time between DPLloyd and EU¬ 
GkM, e = 0.1 


7. CONCLUSION AND DISCUSSIONS 

We have improved the state of art on differentially private 
fc-means clustering in several ways. We have introduced non¬ 
interactive methods for differentially private fc-means clustering, 
and have extensively evaluated and analyzed three interactive meth¬ 
ods and one non-interactive methods. Our proposed EUGkM out¬ 
performs existing methods. We have also introduced the novel 
concept of hybrid approach to differentially private data analysis, 
which is so far the best approach to fc-means clustering. 

Concerning the question of non-interactive versus interactive, the 
insights obtained from fc-means clustering are as follows. The non¬ 
interactive EUGkM has clear advantage, especially when the pri¬ 
vacy budget e is small. Considering the further advantage that non¬ 
interactive methods enable other analysis on the dataset, we would 
tentatively conclude that non-interactive is the winner in this com¬ 
parison. We conjecture that this tradeoff will hold for many other 
data analysis tasks. We plan to investigate whether this holds in 


other analysis tasks. Also, if one’s goal is to improve the accuracy 
of one fc-means clustering task as much as possible, then hybrid 
approaches may be the most promising solution. 
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